Thera Bank management wants to explore ways of converting its liability customers to personal loan customers. A previous campaign had an observed success rate of 9%.
A dataset has been provided with data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan).
The classification goal is to predict the likelihood of a liability customer buying personal loans.
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import pandas_profiling
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', -1)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score, accuracy_score, roc_curve, auc
import statsmodels.api as sm
from scipy import stats
from yellowbrick.classifier import ClassificationReport, ROCAUC
df_LoanModelling = pd.read_csv('Bank_Personal_Loan_Modelling.csv')
df_LoanModelling.head()
df_LoanModelling.info()
The data is all numeric (interpreted as integer or float). Some columns have been interpreted as continuous but are categorical - this will require correction before regression (below).
df_LoanModelling.describe().transpose()
The column 'ID' appears to be an index, as opposed to a customer identifier. The columns 'Income' and 'Mortgage' both appear right skewed. The mean income is 73,774, and the mean mortgage is 56,498.
The 'Experience' column has a minimum value of -3. This column is described in the brief as "years of professional experience", so negative values are invalid. The rows with negative values will either need to be dropped or have new values imputed (below).
The mean age of customers in this dataset is 45 years, and the distribution appears normal.
pandas_profiling.ProfileReport(df_LoanModelling)
The 'ID' column contains 5000 unique values in the range 1 to 5000, so this is an index and will be dropped (below). The correlation plot shows that 'Age' and 'Experience' are extremely highly correlated - this may be useful for imputing new values for 'Experience' to replace the invalid negative values noted in the summary above. The plots for 'Income', 'CCAvg', and 'Mortgage' show all are right skewed.
It was noted in the project brief that the uptake rate of personal loan in the last campaign was 9%, so this means the dependent variable 'Personal Loan' is imbalanced. There appear to be equivalently low rates of usage of Securities Accounts and CD Accounts.
In terms of correlations with the target variable 'Personal Loan', the Pearson's correlation matrix shows there appear to be correlations with 'Income', 'CCAvg', and 'CD Account'.
print(df_LoanModelling.columns)
print(df_LoanModelling.shape)
The dataset contains 5000 rows, which is the expected number as stated in the project brief.
The column 'Personal Loan' contains boolean data, which is whether (i.e. Yes/No) the customer accepted a personal loan in the last campaign. As the objective of this project is to predict the likelihood of a customer buying a personal loan in a future campaign, 'Personal Loan' is the dependent variable for this analysis. There are 12 independent variables and 1 index.
df_LoanModelling.isnull().sum()
There are no null values. However, as per the observations at 1.3 above, the min value of 'Experience' is -3, which indicates errors in the data.
df_NegativeExperience = df_LoanModelling[df_LoanModelling['Experience'] < 0]
print(df_NegativeExperience.shape)
print(df_NegativeExperience['Experience'].value_counts())
There are 52 rows with a negative 'Experience' value, and the negative values are in the range -3 to -1.
For the purposes of this analysis, we'll assume we don't know why the negatives exist - these could be data entry errors, but there could be some other issue.
With this assumption, one option is to simply drop the rows with the negative Experience values, as we know this data is invalid. Firstly, we should investigate whether it's possible to identify a number to impute in place of the negative.
One option is to simply replace the negative values with the mean, median, or mode depending on the type of variable and the distribution of the data, and 'Experience' looks to be normally distributed.
plt.figure(figsize=(12,6))
plt.axvline(df_LoanModelling['Experience'].mean(),color='green',label='Mean')
plt.axvline(df_LoanModelling['Experience'].median(),color='blue',label='Median')
sns.distplot(df_LoanModelling['Experience']);
However, the profiling report at 1.3.1 above shows that 'Experience' is highly correlated with 'Age' in this dataset.
sns.jointplot(df_LoanModelling['Age'], df_LoanModelling['Experience']);
This suggests a more thorough approach that preserves the correlation in the data is not to use the mean for the whole distribution as the imputed value, but to use a linear regression model to predict an Experience value for Age, and use this to impute a value for Experience using the Age for the rows that currently have the invalid negative values for Experience.
# As negative values for Experience are errors, exclude them from the regression model
df_NonNegExp = df_LoanModelling.copy()
indexes = df_NonNegExp[df_NonNegExp['Experience'] < 0].index
df_NonNegExp.drop(indexes,inplace=True)
print("Dropped " + str(indexes.size) + " rows with negative Experience from the training data.")
# Create the datasets for the regression
X = df_NonNegExp[['Age']]
Y = df_NonNegExp[['Experience']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=1)
# Perform the regression and output the coefficient and intercept and R^2 metrics
linreg_Experience = LinearRegression()
linreg_Experience.fit(X_train, Y_train)
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, linreg_Experience.coef_[0][idx]))
intercept = linreg_Experience.intercept_[0]
print("Intercept: {}".format(intercept))
print("Training R^2: {}".format(linreg_Experience.score(X_train, Y_train)))
print("Test R^2: {}".format(linreg_Experience.score(X_test, Y_test)))
As expected given the extremely high correlation in the data, the R^2 results for train and test are both extremely high. The intercept is noteworthy however, as we know from the statistical analysis at 1.3 above that the min value for Age is 23, whereas eyeballing the values suggests that values for Age of 25 and below will produce a negative predicted value for Experience.
This is shown by applying the model to the min observed value for Age and checking the prediction.
predicted = round((linreg_Experience.predict(np.array([[df_LoanModelling['Age'].min()]]))).item(0))
print("Minimum observed value for Age in the dataset is {}".format(df_LoanModelling['Age'].min()))
print("Experience value predicted for this age is {}".format(predicted))
So, for values of Age where the model predicts a negative value, we will use zero.
# Create a dataframe to capture the results
df_ExpErrors = pd.DataFrame(columns=['Age','Experience','Predicted Experience'])
# Iterate the rows with negative experience, and use the regression model to predict the
# value of Experience based on Age, and impute this value or 0 if the prediction is negative
indexes = df_LoanModelling[df_LoanModelling['Experience'] < 0].index
i = 0
for index in indexes:
predict = round((linreg_Experience.predict(np.array([[df_LoanModelling.loc[index,'Age']]]))).item(0))
df_ExpErrors.loc[i] = [df_LoanModelling.loc[index,'Age'], df_LoanModelling.loc[index,'Experience'], predict]
i+=1
if predict < 0:
df_LoanModelling.loc[index,'Experience'] = 0
else:
df_LoanModelling.loc[index,'Experience'] = predict
# NOTE: It is understood that iteratively adding individual rows to a pandas dataframe is
# computationally intensive for larger operations, however as we've already established only
# 52 rows are involved this is not considered impactful.
print(df_ExpErrors.shape)
indexes = df_ExpErrors[df_ExpErrors['Predicted Experience'] > 0].index
print("The minimum Age with negative experience was {}".format(df_ExpErrors['Age'].min()))
print("The minimum Age with negative experience was {}".format(df_ExpErrors['Age'].max()))
print("The number of predictions greater than zero was {}".format(indexes.size))
df_ExpErrors['Age'].value_counts()
Finally, confirm that there are no more negative values for Experience in the dataframe.
any(df_LoanModelling['Experience'] < 0)
Interestingly, all the Experience errors occurred for Ages 23 to 29, and as the small number of predictions that were greater than zero shows, the errors cluster around ages 23 to 25 - this strongly suggests the introduction of errors into this dataset in the form of negative values for Experience is not completely random.
Although not part of my findings for this model, if this was in a business context I would be sharing this with the client, as it suggests some of the Experience values in this dataset may have already been imputed (as opposed to random errors in data entry).
df_LoanModelling.info()
Based on the analysis above and the incoming description of the data in the project brief, there are four changes that need to be made;
Firstly, the ID column is an index - the summary at 1.3 and 1.3.1 shows it contains 5000 unique values from 1 to 5000. This is neither categorical or continuous data, so should be dropped.
# The "ID" column appears to be an index - confirm by examining number of unique values vs shape
(df_LoanModelling['ID'].unique()).size
df_LoanModelling.drop(['ID'], axis=1, inplace=True)
df_LoanModelling.head()
Secondly, the ZIP Code column has been interpreted as continuous, whereas ZIP Codes are categorical variables for address/location. Additionally, the fields Personal Loan, Securities Account, CD Account, Online, and CreditCard are also categorical.
# Zip should be categorical
df_LoanModelling['ZIP Code'] = df_LoanModelling['ZIP Code'].astype('object')
df_LoanModelling['Family'] = df_LoanModelling['Family'].astype('object')
# Personal Loan, Securities Account, CD Account, Online, CreditCard should be categorical
df_LoanModelling['Personal Loan'] = df_LoanModelling['Personal Loan'].astype('category')
df_LoanModelling['Securities Account'] = df_LoanModelling['Securities Account'].astype('category')
df_LoanModelling['CD Account'] = df_LoanModelling['CD Account'].astype('category')
df_LoanModelling['Online'] = df_LoanModelling['Online'].astype('category')
df_LoanModelling['CreditCard'] = df_LoanModelling['CreditCard'].astype('category')
df_LoanModelling.info()
Finally, Education is actually a categorical variable as well - the values 1, 2, 3 are indexes, which refer to the descriptions of the person's education level, and are not numeric data. To make the data readable later, we'll convert Education to a categorical variable and substitute the values, then one hot encode the data before building the regression model.
df_LoanModelling['Education'] = df_LoanModelling['Education'].astype('object')
df_LoanModelling['Education'].replace(to_replace=1, value="Undergraduate", inplace=True)
df_LoanModelling['Education'].replace(to_replace=2, value="Graduate", inplace=True)
df_LoanModelling['Education'].replace(to_replace=3, value="Advanced/Professional", inplace=True)
df_LoanModelling['Education'].value_counts(normalize=True)
df_LoanModelling.info()
The datatypes now match the description of the data in the brief and the results of the analysis above. The continuous variables are Age, Experience, Income, CCAvg, and Mortgage, and the categorical variables are ZIP Code, Family, Education, Personal Loan, Securities Account, CD Account, Online, and CreditCard. We've also prepared Education for later encoding.
df_LoanModelling.nunique()
The high number of unique values for ZIP Code (467) suggest it may not be a useful categorical variable. The correlation matrix in the profile report at 1.3.1 (above) also suggests ZIP Code is not correlated with any of the other variables. One approach could be to bin the data to reduce the number of categories, possibly by binning adjacent zip codes, or resampling at a higher level (e.g. county). For the purposes of this analysis though, ZIP Code will be dropped.
df_LoanModelling.drop(labels=['ZIP Code'], axis=1, inplace=True)
indexes = df_LoanModelling[df_LoanModelling['Mortgage'] == 0].index
print("Number of customers with zero mortgage: {}".format(indexes.size))
print(str((indexes.size / df_LoanModelling['Mortgage'].size)*100) + "%")
69.2% of the customers in the dataset (3462 out of 5000) do not have a mortgage with Thera Bank.
indexes = df_LoanModelling[df_LoanModelling['CCAvg'] == 0].index
print("Number of customers with zero credit card spending: {}".format(indexes.size))
106 out of 5000 customers have zero credit card spending per month. This may be noteworthy later in the analysis of regression models, as Thera Bank's brief specifically nominates "liability customers" as the focus for the personal loan campaign. Customers with no credit card and no credit card spending would not ordinarily be defined as liability customers.
for i in list(df_LoanModelling.columns[df_LoanModelling.dtypes=='object']):
print(df_LoanModelling[i].value_counts())
print("\n")
for i in list(df_LoanModelling.columns[df_LoanModelling.dtypes=='category']):
print(df_LoanModelling[i].value_counts())
print("\n")
As noted in the analysis above, the dependent variable 'Personal Loan' is imbalanced, reflecting the information in the project brief that 9% of customers accepted a personal loan in the last campaign. Of note also are the 'Securities Account' and 'CD Account' columns, which appear to have a similar balance - it will be worth investigating if these are correlated in the bivariate analysis (below).
As per the outcomes of the analysis and preparation steps above, there are five continuous independent variables - Age, Experience, Income, CCAvg, and Mortgage.
plt.figure(figsize=(10,3))
sns.distplot(df_LoanModelling['Age']);
plt.figure(figsize=(10,3))
sns.distplot(df_LoanModelling['Experience']);
plt.figure(figsize=(10,3))
sns.distplot(df_LoanModelling['Income']);
plt.figure(figsize=(10,3))
sns.distplot(df_LoanModelling['CCAvg']);
plt.figure(figsize=(10,3))
sns.distplot(df_LoanModelling['Mortgage']);
'Age' and 'Experience' appear normally distributed. As per the analysis at 1.3 and 1.3.1 above, 'Income', 'CCAvg', and 'Mortgage' are all right skewed. We'll next examine box plots to see the outliers causing the skews.
plt.figure(figsize=(12,2))
sns.boxplot(x=df_LoanModelling.Age)
plt.figure(figsize=(12,2))
sns.boxplot(x=df_LoanModelling.Experience)
plt.figure(figsize=(12,2))
sns.boxplot(x=df_LoanModelling.Income)
plt.figure(figsize=(12,2))
sns.boxplot(x=df_LoanModelling.CCAvg)
plt.figure(figsize=(12,2))
sns.boxplot(x=df_LoanModelling.Mortgage)
The box plots show both the CCAvg and Mortgage columns are heavily right skewed, with a large number of outliers, with Income being less skewed. An observation of the skewness for the three skewed distributions shows this as well. An observation is that addressing skewness in the independent variables is one option for improving for regression model later, if we're unhappy with its performance.
skewness = pd.DataFrame({'Skewness' : [stats.skew(df_LoanModelling.Income),stats.skew(df_LoanModelling.CCAvg)
,stats.skew(df_LoanModelling.Mortgage)]},index=['Income','CCAvg','Mortgage'])
skewness
As per the outcomes of the analysis and preparation steps above, there are two independent categorical variables (Family and Education) and four independent boolean variables (Securities Account, CD Account, Online, and CreditCard). The dependent variable is Personal Loan.
plt.figure(figsize=(8,6))
sns.countplot(df_LoanModelling['Family'])
plt.figure(figsize=(8,6))
sns.countplot(df_LoanModelling['Education'])
plt.figure(figsize=(8,6))
sns.countplot(df_LoanModelling['Personal Loan'])
plt.figure(figsize=(8,6))
sns.countplot(df_LoanModelling['Securities Account'])
plt.figure(figsize=(8,6))
sns.countplot(df_LoanModelling['CD Account'])
plt.figure(figsize=(8,6))
sns.countplot(df_LoanModelling['Online'])
plt.figure(figsize=(8,6))
sns.countplot(df_LoanModelling['CreditCard'])
The imbalances noted at 2.4 (above) in the value counts are evident here, with Personal Loan, Securities Account, and CD Account having significantly more False observations than True. In the case of Personal Loan, a False observation means the customer did not take up a personal loan after the last campaign. In each of Securities Account and CD Account, a False observation means the customer does not have that type of account. The same balance is evident for CreditCard, although not to the same ratio. For the categorical variables, the plots show that most customers have a family of 1, and most customers have a Undergraduate level of education.
sns.pairplot(df_LoanModelling[['Age', 'Income', 'Mortgage', 'CCAvg', 'Personal Loan']], hue = 'Personal Loan')
The highly correlated and linear relationship between Age and Experience has been previously observed in the analysis at 1.3 and at 1.5.2 above in imputing values for the invalid negative values, so Experience is not plotted here as it does not add to the analysis. The distribution of the continuous variables has been discussed in the distribution plots above - however using Personal Loan as the hue divides the Income distribution, with higher income appearing to be more highly correlated with Personal Loan. This is also observed in the scatterplots for Age and Income, with the green plots for Personal Loan increasing as income increases.
The scatterplot for Income and CCAvg show an expected effect with rates of credit card usage increasing as income increases, however it is clear there are many high income customers who do not use credit cards. The same plot is evident for Income and Mortgage. The visualisation between Age and Mortgage, and Age and CCAvg shows no apparent pattern, reflecting the low correlations observed in the matrix at 1.3.1.
sns.countplot(df_LoanModelling['Education'], hue=df_LoanModelling['Personal Loan'])
We're interested in the rate of observations for the dependent variable for each independent categorical variable - this side by side countplot makes the rate difficult to compare, so these will be plotted using crosstab to normalise the observations.
The same visualisation above plotting using a crosstab is below.
dva = pd.crosstab(df_LoanModelling['Education'], df_LoanModelling['Personal Loan'])
ax = dva.div(dva.sum(1).astype(float), axis = 0).plot(kind='bar', stacked=True, figsize=(12,6))
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()
This visualisation suggests that people with higher levels of eduction accept personal loans at a higher rate than Undergraduates.
dva = pd.crosstab(df_LoanModelling['Family'], df_LoanModelling['Personal Loan'])
dva.div(dva.sum(1).astype(float), axis = 0).plot(kind='bar', stacked=True, figsize=(12,6))
This visualisation shows a slight increase in the rate of personal loan acceptance for customers with a family of 3, however the effect seems quite modest.
In terms of the boolean variables, as we have noted in 1.3.1 and 2.4 above there are similar scales of imbalance in CD Account and Securities Account - plotting these with Personal Loan will help visualise the relationship.
dva = pd.crosstab(df_LoanModelling['Securities Account'], df_LoanModelling['Personal Loan'])
dva.div(dva.sum(1).astype(float), axis = 0).plot(kind='bar', stacked=True, figsize=(12,6))
A customer having a securities account appears to not be correlated with them also accepting a personal loan.
dva = pd.crosstab(df_LoanModelling['CD Account'], df_LoanModelling['Personal Loan'])
dva.div(dva.sum(1).astype(float), axis = 0).plot(kind='bar', stacked=True, figsize=(12,6))
Customers who have a CD Account accepted personal loans at a higher rate than customers without CD Accounts in the last campaign.
dva = pd.crosstab(df_LoanModelling['Online'], df_LoanModelling['Personal Loan'])
dva.div(dva.sum(1).astype(float), axis = 0).plot(kind='bar', stacked=True, figsize=(12,6))
Whether a customer uses online banking services appears to make no difference to the rate at which they accepted personal loans.
dva = pd.crosstab(df_LoanModelling['CreditCard'], df_LoanModelling['Personal Loan'])
dva.div(dva.sum(1).astype(float), axis = 0).plot(kind='bar', stacked=True, figsize=(12,6))
The same effect for Online (above) is apparent for CreditCard.
df_LoanModelling.groupby('Personal Loan')['Income'].mean().plot(kind='bar', figsize=(10,5))
This graph plots the average income of customers who accepted a personal loan in the last campaign, against the average income of customers who did not accept a personal loan. This shows clearly the correlation noted at 1.3.1 above - customers who accepted a personal loan have a higher average income (over 140,000), than customers who did not accept a loan (approximately 65,000).
df_LoanModelling.groupby('Personal Loan')['CCAvg'].mean().plot(kind='bar', figsize=(10,5))
Similarly, this graph plots the average monthly credit card usage of customers who accepted a personal loan in the last campaign, against the average monthly credit card usage of customers who did not accept a personal loan. This shows the same correlation noted above - customers who accepted a personal loan have a higher average rate of monthly credit card usage. This may be significant given the client intends to target "liability customers".
X = df_LoanModelling.drop(labels=['Personal Loan'], axis=1)
Y = df_LoanModelling[['Personal Loan']]
X = pd.get_dummies(X,drop_first=True)
The data model has already been prepared in the analysis steps above, so all that is required now is to create the X and Y sets.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30,random_state=1)
print("{0:0.2f}% data is in the Training set".format((len(X_train)/len(df_LoanModelling.index)) * 100))
print("{0:0.2f}% data is in the Test set".format((len(X_test)/len(df_LoanModelling.index)) * 100))
Employing good practice to check the split of the data, confirms that the split is 70:30.
print("Original Personal Loan True Values : {0} ({1:0.2f}%)".format(len(df_LoanModelling.loc[df_LoanModelling['Personal Loan'] == 1]), (len(df_LoanModelling.loc[df_LoanModelling['Personal Loan'] == 1])/len(df_LoanModelling.index)) * 100))
print("Original Personal Loan False Values : {0} ({1:0.2f}%)".format(len(df_LoanModelling.loc[df_LoanModelling['Personal Loan'] == 0]), (len(df_LoanModelling.loc[df_LoanModelling['Personal Loan'] == 0])/len(df_LoanModelling.index)) * 100))
print("")
print("Training Personal Loan True Values : {0} ({1:0.2f}%)".format(len(Y_train.loc[Y_train['Personal Loan'] == 1]), (len(Y_train.loc[Y_train['Personal Loan'] == 1])/len(Y_train.index)) * 100))
print("Training Personal Loan False Values : {0} ({1:0.2f}%)".format(len(Y_train.loc[Y_train['Personal Loan'] == 0]), (len(Y_train.loc[Y_train['Personal Loan'] == 0])/len(Y_train.index)) * 100))
print("")
print("Testing Personal Loan True Values : {0} ({1:0.2f}%)".format(len(Y_test.loc[Y_test['Personal Loan'] == 1]), (len(Y_test.loc[Y_test['Personal Loan'] == 1])/len(Y_test.index)) * 100))
print("Testing Personal Loan False Values : {0} ({1:0.2f}%)".format(len(Y_test.loc[Y_test['Personal Loan'] == 0]), (len(Y_test.loc[Y_test['Personal Loan'] == 0])/len(Y_test.index)) * 100))
print("")
Checking that in the split, the balance of the dependent variable has remained consistent with the source data. The training and test sets show very similar balance to the source data set.
logit_PersonalLoan = sm.Logit(Y_train, sm.add_constant(X_train))
lg = logit_PersonalLoan.fit()
print(lg.summary())
The logit regression has produced a pseudo R^2 of 0.6565 which indicates that 65.65% of the uncertainty of the intercept model is explained. The regression result also notes that the data contains quasi-separation, which suggests there are some variables that separate the data with little overlap. The result of plots like Income and CCAvg with Personal Loan hue (above) show some candidates where there appears to be little overlap in the data, however as it is quasi-separation we will continue for now.
Noting we have established in the exploratory data analysis that wew have an imbalanced dependent variable, we will the _classweight='balanced' parameter for the regression model.
Firstly, create a dataframe to store the results for each regression model, so they are easier to visualise and compare.
df_ModelComparison = pd.DataFrame(columns=['Model', 'Train Score', 'Test Score', 'Recall', 'Precision', 'F1 Score', 'ROC AUC Score'])
Then perform the regression, and then the prediction.
logreg_PersonalLoan = LogisticRegression(random_state=42, class_weight='balanced')
logreg_PersonalLoan.fit(X_train, Y_train)
Y_predict = logreg_PersonalLoan.predict(X_test)
Define a function to draw a confusion matrix.
def draw_cm( actual, predicted ):
cm = confusion_matrix( actual, predicted)
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()
print('Confusion Matrix')
print(draw_cm(Y_test,Y_predict))
train_score = round(logreg_PersonalLoan.score(X_train,Y_train),5)
test_score = round(logreg_PersonalLoan.score(X_test, Y_test), 5)
recall = round(recall_score(Y_test,Y_predict), 5)
precision = round(precision_score(Y_test,Y_predict), 5)
f1score = round(f1_score(Y_test,Y_predict), 5)
rocauc_score = round(roc_auc_score(Y_test,Y_predict), 5)
df_ModelComparison.loc[0] = ['1 - Default Balanced Class', train_score, test_score, recall, precision, f1score, rocauc_score]
pd.options.display.float_format = "{:.5f}".format
df_ModelComparison
To return briefly to the project objective, the classification goal is to predict the likelihood of a liability customer buying personal loans - this is to support the bank in its goal of converting liability customers to personal loan customers.
So, the business context for this model is critical, and this means the priority for this model is to reduce false negative predictions - that is, the model predicting that a customer will not accept a personal loan, whereas the observed result in the test data is that the customer did accept a personal loan. The cost to the bank for this error is the lost interest revenue from a customer who would have accepted a loan, whereas the cost to the bank of the model predicting a customer would accept a loan who then doesn't accept a loan is negligible - perhaps the cost of sending some correspondence to make the offer.
This means that Recall, which is sensitive to the rate of false negative predictions, is a key metric when considering the business problem being solved. In this first instance, the model has produced a high recall of 0.87248, with 130 of the 149 positive observations predicted correctly.
Within the context of logistic regression only, I have considered four strategies for how the model could perform better.
This routine creates an array of tuples of penalty and solver, then iterates the array using the tuple for the logistic regression parameters each time, and populates a dataframe with the results, which are sorted by Test Score then Train Score.
df_SolverTest = pd.DataFrame(columns=['Penalty','Solver','Train Score', 'Test Score'])
solvers = [('l2', 'newton-cg'), ('l2', 'lbfgs'), ('l2', 'liblinear'), ('l2', 'sag'), ('l2', 'saga'), ('l1', 'liblinear'), ('l1', 'saga')]
i = 0
for penalty, solver in solvers:
logreg_SolverTest = LogisticRegression(random_state=42, penalty=penalty, solver=solver, class_weight='balanced')
logreg_SolverTest.fit(X_train, Y_train)
df_SolverTest.loc[i] = [penalty, solver, round(logreg_SolverTest.score(X_train, Y_train),5), round(logreg_SolverTest.score(X_test, Y_test),5)]
i+=1
df_SolverTest.sort_values(by=['Test Score', 'Train Score'], ascending=False, inplace=True)
df_SolverTest.reset_index(drop=True,inplace=True)
df_SolverTest
The liblinear solver with l1 penalty produced the highest test and train score, but the model is still slightly overfit.
The next routine uses the outcome of the penalty and solver test, and then iterates an array of different values for C, and outputs the results to a dataframe, again sorted by Test and then Train score.
df_CTest = pd.DataFrame(columns=['C', 'Train Score', 'Test Score'])
penalty = df_SolverTest.loc[0,'Penalty']
solver = df_SolverTest.loc[0,'Solver']
C_parameters = [0.01,0.1,0.25,0.5,0.75,1]
i = 0
for C in C_parameters:
logreg_CTest = LogisticRegression(random_state=42, C=C, penalty=penalty, solver=solver, class_weight='balanced')
logreg_CTest.fit(X_train, Y_train)
df_CTest.loc[i] = [C, round(logreg_CTest.score(X_train,Y_train),5), round(logreg_CTest.score(X_test, Y_test),5)]
i+=1
df_CTest.sort_values(by=['Test Score', 'Train Score'], ascending=False, inplace=True)
df_CTest.reset_index(drop=True,inplace=True)
print("Penalty: {}".format(penalty))
print("Solver: {}".format(solver))
df_CTest
At a C value of 0.1, the highest Test score is observed - the model is also no longer overfit.
penalty = df_SolverTest.loc[0,'Penalty']
solver = df_SolverTest.loc[0,'Solver']
C = df_CTest.loc[0, 'C']
logreg_PersonalLoan_v2 = LogisticRegression(random_state=42, penalty=penalty, solver=solver, C=C, class_weight='balanced')
logreg_PersonalLoan_v2.fit(X_train, Y_train)
Y_predict = logreg_PersonalLoan_v2.predict(X_test)
print('Confusion Matrix')
print(draw_cm(Y_test,Y_predict))
train_score = round(logreg_PersonalLoan_v2.score(X_train,Y_train), 5)
test_score = round(logreg_PersonalLoan_v2.score(X_test, Y_test), 5)
recall = round(recall_score(Y_test,Y_predict), 5)
precision = round(precision_score(Y_test,Y_predict), 5)
f1score = round(f1_score(Y_test,Y_predict), 5)
rocauc_score = round(roc_auc_score(Y_test,Y_predict), 5)
df_ModelComparison.loc[1] = ['2 - Optimised Parameters', train_score, test_score, recall, precision, f1score, rocauc_score]
df_ModelComparison.sort_values(by=['Recall', 'ROC AUC Score'], ascending=False, inplace=True)
df_ModelComparison.reset_index(drop=True,inplace=True)
df_ModelComparison
Although optimising the parameters has increased the Precision and F1 Score, it produces a lower Recall and slightly lower AUC Score. The impact of the Recall performance is that 1 fewer positive loan responses were correctly predicted, so the default parameters actually performed better in the context of this business problem.
The next strategy is to use the regression model to predict probabilities, instead of predicting a value, and to then examine the effect of different thresholds at which the probability of a positive value (i.e. the acceptance of a loan) is used to impute the positive value.
This first routine uses predict_proba on the first regression model to output the probabilities to a dataframe, then iterates an array of thresholds at which which the probability imputes a positive value, and then calculates the ROC AUC Score against the test data. The resulting array is sorted by the AUC Score.
df_ThresholdTest = pd.DataFrame(columns=['Threshold', 'ROC AUC Score'])
Y_actual = pd.DataFrame()
Y_actual['Observed'] = Y_test['Personal Loan']
Y_actual.reset_index(drop=True,inplace=True)
Y_predict_proba = pd.concat([Y_actual, pd.DataFrame(logreg_PersonalLoan.predict_proba(X_test))], axis=1)
Y_predict_proba.rename(columns ={0:'Prob_0', 1:'Prob_1'}, inplace=True)
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
i = 0
for threshold in thresholds:
Y_predict_proba['Predicted'] = Y_predict_proba.Prob_1.map( lambda x: 1 if x > threshold else 0)
df_ThresholdTest.loc[i] = [threshold, roc_auc_score(Y_test,Y_predict_proba['Predicted'])]
i+=1
df_ThresholdTest.sort_values(by=['ROC AUC Score'], ascending=False, inplace=True)
df_ThresholdTest.reset_index(drop=True,inplace=True)
df_ThresholdTest
A threshold of 0.4 produced an AUC score of 0.90126, so now create a prediction model using this threshold and output a confusion matrix.
threshold = df_ThresholdTest.loc[0, 'Threshold']
Y_predict_proba['Predicted'] = Y_predict_proba.Prob_1.map( lambda x: 1 if x > threshold else 0)
print('Confusion Matrix')
print(draw_cm(Y_test,Y_predict_proba['Predicted']))
train_score = round(logreg_PersonalLoan.score(X_train,Y_train), 5)
test_score = round(logreg_PersonalLoan.score(X_test, Y_test) ,5)
recall = round(recall_score(Y_test,Y_predict_proba['Predicted']), 5)
precision = round(precision_score(Y_test,Y_predict_proba['Predicted']), 5)
f1score = round(f1_score(Y_test,Y_predict_proba['Predicted']), 5)
rocauc_score = round(roc_auc_score(Y_test,Y_predict_proba['Predicted']), 5)
df_ModelComparison.loc[2] = ['3 - Probability Thresholds', train_score, test_score, recall, precision, f1score, rocauc_score]
df_ModelComparison.sort_values(by=['Recall', 'ROC AUC Score'], ascending=False, inplace=True)
df_ModelComparison.reset_index(drop=True,inplace=True)
df_ModelComparison
Using probabilities and a threshold of 0.4 to make the prediction has increased the Recall to 0.91946, with 137 of the 149 positive observations correctly predicted. The ROC AUC Score has also increased. This comes at the cost of a lower precision score as the low probability threshold (0.4) means the rate of false positives has increased. However, as noted in the observations under the default model above, in the business context for this model this means the bank would only miss 12 customers who would have taken up a loan, and there is no real cost to offering a loan to a customer who doesn't accept it. So, this model is preferred to the first model.
logit_PersonalLoan = sm.Logit(Y_train, sm.add_constant(X_train))
lg = logit_PersonalLoan.fit()
print(lg.summary())
Use the logit to produce a dataframe which lists the odds ratio and probability for the independent variables, filtered for a value of p <= 0.1.
lg_Coef = pd.DataFrame(lg.params, columns=['coef'])
lg_Coef.loc[:, "Odds_Ratio"] = np.exp(lg_Coef.coef)
lg_Coef['Probability'] = lg_Coef['Odds_Ratio']/(1+lg_Coef['Odds_Ratio'])
lg_Coef['pval']=lg.pvalues
pd.options.display.float_format = '{:.2f}'.format
lg_Coef = lg_Coef.sort_values(by="Odds_Ratio", ascending=False)
pval_filter = lg_Coef['pval']<=0.1
pd.options.display.float_format = "{:.2f}".format
lg_Coef[pval_filter]
Drop the excluded columns from the dataset.
X.drop(['Age','Experience','Mortgage','Family_2','Education_Graduate'],axis=1,inplace=True)
print(X.columns)
Resplit the data as X has changed, and redo the logit fit.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30,random_state=1)
logit = sm.Logit(Y_train, sm.add_constant(X_train))
lg = logit.fit()
print(lg.summary())
Note that the quasi-separation is still present, although as above the model is giving good predictions and the data is not completely separated, so we will continue.
Redo the regression with the default parameters using the new datasets with the irrelevant independent variables removed.
logreg_PersonalLoan_v3 = LogisticRegression(random_state=42, class_weight='balanced')
logreg_PersonalLoan_v3.fit(X_train, Y_train)
Y_predict = logreg_PersonalLoan_v3.predict(X_test)
print('Confusion Matrix')
print(draw_cm(Y_test,Y_predict))
train_score = round(logreg_PersonalLoan_v3.score(X_train,Y_train), 5)
test_score = round(logreg_PersonalLoan_v3.score(X_test, Y_test), 5)
recall = round(recall_score(Y_test,Y_predict), 5)
precision = round(precision_score(Y_test,Y_predict), 5)
f1score = round(f1_score(Y_test,Y_predict), 5)
rocauc_score = round(roc_auc_score(Y_test,Y_predict), 5)
df_ModelComparison.loc[3] = ['4 - Removed Irrelevant Variables', train_score, test_score, recall, precision, f1score, rocauc_score]
df_ModelComparison.sort_values(by=['Recall', 'ROC AUC Score'], ascending=False, inplace=True)
df_ModelComparison.reset_index(drop=True,inplace=True)
pd.options.display.float_format = "{:.5f}".format
df_ModelComparison
This model slightly outperforms the original model, but only in terms of a fractionally higher (0.8959 vs 0.8944) ROC AUC Score - the Recall scores are identical, with the model also predicting 130 of the 149 positive results. As such, the model using probability thresholds is still the preferred model.
Finally, it is noteworthy that the customer's objective pertains to "liability" customers - that is, the bank is not targeting customers who do not have a credit card and have zero monthly usage (I've excluded Mortgage customers as "liability" customers, since a mortgage is secured against the property). We also know we have an imbalanced dependent variable, so it's worth examining whether exluding non-liability customers from the data affects the imbalance and improves the prediction.
Firstly, make a copy of the main dataframe, and drop all the rows where there is no credit card AND no monthly spend.
df = df_LoanModelling.copy()
indexes = df[(df['CreditCard'] == 0) & (df['CCAvg'] == 0)].index
df.drop(indexes,inplace=True)
df.shape
This has dropped 78 rows from the data. So, recreate the X and Y data and resplit, and run the regression again with the default parameters and output the results.
X = df.drop(labels=['Personal Loan'], axis=1)
Y = df[['Personal Loan']]
X = pd.get_dummies(X,drop_first=True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30,random_state=1)
logreg_PersonalLoan_v4 = LogisticRegression(random_state=42, class_weight='balanced')
logreg_PersonalLoan_v4.fit(X_train, Y_train)
Y_predict = logreg_PersonalLoan_v4.predict(X_test)
print('Confusion Matrix')
print(draw_cm(Y_test,Y_predict))
train_score = round(logreg_PersonalLoan_v4.score(X_train,Y_train), 5)
test_score = round(logreg_PersonalLoan_v4.score(X_test, Y_test), 5)
recall = round(recall_score(Y_test,Y_predict), 5)
precision = round(precision_score(Y_test,Y_predict), 5)
f1score = round(f1_score(Y_test,Y_predict), 5)
rocauc_score = round(roc_auc_score(Y_test,Y_predict), 5)
df_ModelComparison.loc[4] = ['5 - Liability Customers Only', train_score, test_score, recall, precision, f1score, rocauc_score]
df_ModelComparison.sort_values(by=['Recall', 'ROC AUC Score'], ascending=False, inplace=True)
df_ModelComparison.reset_index(drop=True,inplace=True)
df_ModelComparison
Removing the non-liability customers had no real effect on the model - a fractionally lower Recall and RUC AUC Score.
So, the strategy that gave the best result for improving the model was using probabilty thresholds - this improved Recall from 0.87248 to 0.91946, and improved the ROC AUC Score from 0.89442 to 0.90126.
Further strategies for improvement could include addressing the imbalance in the target variable using oversampling, or addressing the skewness in the independent variables by dropping outliers, or using a scaler - however, a Recall of 91.2% on the full dataset is a good result.
X = df_LoanModelling.drop(labels=['Personal Loan'], axis=1)
Y = df_LoanModelling[['Personal Loan']]
X = pd.get_dummies(X,drop_first=True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30,random_state=1)
logreg_PersonalLoan = LogisticRegression(random_state=42, class_weight='balanced')
logreg_PersonalLoan.fit(X_train, Y_train)
Y_actual = pd.DataFrame()
Y_actual['Observed'] = Y_test['Personal Loan']
Y_actual.reset_index(drop=True,inplace=True)
Y_predict_proba = pd.concat([Y_actual, pd.DataFrame(logreg_PersonalLoan.predict_proba(X_test))], axis=1)
Y_predict_proba.rename(columns ={0:'Prob_0', 1:'Prob_1'}, inplace=True)
threshold = df_ThresholdTest.loc[0, 'Threshold']
Y_predict_proba['Predicted'] = Y_predict_proba.Prob_1.map( lambda x: 1 if x > threshold else 0)
print('Confusion Matrix')
print(draw_cm(Y_test,Y_predict_proba['Predicted']))
df_ModelComparison
logit_roc_auc = roc_auc_score(Y_test, Y_predict_proba['Predicted'])
fpr, tpr, thresholds = roc_curve(Y_test, Y_predict_proba['Predicted'])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Personal Loan acceptance')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
So, the final model produces a logistic regression model using the default parameters, with the class_weight='balanced' parameter to address the imbalance in the target variable. The model is then used to predict probabilities, with any probability above 0.4 imputed to be positive.
The results of this model on the test data were;
In the context of the business objectives of the client, false negatives come at a higher cost than false positives. This is because a false negative costs the bank money in lost business (because a loan is not offered to someone who would accept it), whereas a false positive costs nothing (a customer simply declines the offer).
This means Recall is the key metric, and this model produces a Recall of 91.9% with 137 of 149 positive observations predicted. The model also has a high ROC AUC score of 0.90126.
In terms of the business outcome, this means that a campaign based on this model is likely to reach 91.9% of the customers who would accept a personal loan.